Recap¶

¿Que estudiamos y que vimos?

  • Instalaciones de python y de git
  • Command line y como manejarse con comandos básicos de linux
  • Instalación de librerías de python
  • Sete de un ambiente
  • lectura de distintos tipos de datos
  • Pandas
    • Informe simple de los datos
    • Filtros y operaciones
    • Limpieza de datos
    • Joins de datasets
    • Concats
  • Analisis Exploratorio
  • Distintos datos = Distintos tipos de problemas
  • Regresión lineal
  • Regresión logística.

A continuación

  • Repaso de joins y concats
  • Tips de programación
  • Ejercicio Regresión lineal
  • Ejercicio Regresión Logistica

Joins y concats¶

  • ejemplo teorico
  • Olist
In [4]:
make_df('ABC', range(3))
Out[4]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
In [5]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
In [6]:
display('df1', 'df2', 'pd.concat([df1, df2])')
Out[6]:

df1

A B
1 A1 B1
2 A2 B2

df2

A B
3 A3 B3
4 A4 B4

pd.concat([df1, df2])

A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4

Pandas preserva el indice. Suponiendo elcaso de dos datases con el mismo indice. 0-1

In [7]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  
display('x', 'y', 'pd.concat([x, y])')
Out[7]:

x

A B
0 A0 B0
1 A1 B1

y

A B
0 A2 B2
1 A3 B3

pd.concat([x, y])

A B
0 A0 B0
1 A1 B1
0 A2 B2
1 A3 B3

Opciones:

  • Ignorar indices
In [8]:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
Out[8]:

x

A B
0 A0 B0
1 A1 B1

y

A B
0 A2 B2
1 A3 B3

pd.concat([x, y], ignore_index=True)

A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3

¿Que pasa cuando falta una columna?

In [9]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
Out[9]:

df5

A B C
1 A1 B1 C1
2 A2 B2 C2

df6

B C D
3 B3 C3 D3
4 B4 C4 D4

pd.concat([df5, df6])

A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4
In [10]:
sellers_df = pd.read_csv("data/olist_sellers_dataset.csv")
sellers_df
Out[10]:
seller_id seller_zip_code_prefix seller_city seller_state
0 3442f8959a84dea7ee197c632cb2df15 13023 campinas SP
1 d1b65fc7debc3361ea86b5f14c68d2e2 13844 mogi guacu SP
2 ce3ad9de960102d0677a81f5d0bb7b2d 20031 rio de janeiro RJ
3 c0f3eea2e14555b6faeea3dd58c1b1c3 4195 sao paulo SP
4 51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
... ... ... ... ...
3090 98dddbc4601dd4443ca174359b237166 87111 sarandi PR
3091 f8201cab383e484733266d1906e2fdfa 88137 palhoca SC
3092 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
3093 e603cf3fec55f8697c9059638d6c8eb5 96080 pelotas RS
3094 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP

3095 rows × 4 columns

Hago una muestra del dataset con cantidad = sample(n)

In [12]:
diez_datos = sellers_df.sample(10)
diez_datos
Out[12]:
seller_id seller_zip_code_prefix seller_city seller_state
1327 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP
1900 344223b2a90784f64136a8a5da012e7f 89117 gaspar SC
593 30829ded4523ab9224b93bc49a62c95f 95360 parai RS
2060 a2b911e9ad40dc7d61c142a4683e2d85 27353 barra mansa RJ
1574 2d9e8987a2ea5cf88dce330249043ff1 15155 jaci SP
1130 e3dd723429d1b23614b7fa85e2c7a853 9810 sao bernardo do campo SP
2377 eed78ac17f7f795a19a709745f00cd4e 18608 botucatu SP
790 bd697db56ff8e7c9cd1abeb7f72029b2 3533 sao paulo SP
1362 8a43128d7f9a3db592b866e6861a6cce 8080 sao paulo SP
2677 49f8d2ea7b718d0ce587d34207caf2b4 9830 sao bernardo do campo SP
In [13]:
cinco_datos = sellers_df.tail(5)
cinco_datos
Out[13]:
seller_id seller_zip_code_prefix seller_city seller_state
3090 98dddbc4601dd4443ca174359b237166 87111 sarandi PR
3091 f8201cab383e484733266d1906e2fdfa 88137 palhoca SC
3092 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
3093 e603cf3fec55f8697c9059638d6c8eb5 96080 pelotas RS
3094 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP
In [15]:
pd.concat([diez_datos,cinco_datos])
Out[15]:
seller_id seller_zip_code_prefix seller_city seller_state
1327 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP
1900 344223b2a90784f64136a8a5da012e7f 89117 gaspar SC
593 30829ded4523ab9224b93bc49a62c95f 95360 parai RS
2060 a2b911e9ad40dc7d61c142a4683e2d85 27353 barra mansa RJ
1574 2d9e8987a2ea5cf88dce330249043ff1 15155 jaci SP
1130 e3dd723429d1b23614b7fa85e2c7a853 9810 sao bernardo do campo SP
2377 eed78ac17f7f795a19a709745f00cd4e 18608 botucatu SP
790 bd697db56ff8e7c9cd1abeb7f72029b2 3533 sao paulo SP
1362 8a43128d7f9a3db592b866e6861a6cce 8080 sao paulo SP
2677 49f8d2ea7b718d0ce587d34207caf2b4 9830 sao bernardo do campo SP
3090 98dddbc4601dd4443ca174359b237166 87111 sarandi PR
3091 f8201cab383e484733266d1906e2fdfa 88137 palhoca SC
3092 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
3093 e603cf3fec55f8697c9059638d6c8eb5 96080 pelotas RS
3094 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP
In [16]:
pd.concat([diez_datos,cinco_datos], ignore_index=True)
Out[16]:
seller_id seller_zip_code_prefix seller_city seller_state
0 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP
1 344223b2a90784f64136a8a5da012e7f 89117 gaspar SC
2 30829ded4523ab9224b93bc49a62c95f 95360 parai RS
3 a2b911e9ad40dc7d61c142a4683e2d85 27353 barra mansa RJ
4 2d9e8987a2ea5cf88dce330249043ff1 15155 jaci SP
5 e3dd723429d1b23614b7fa85e2c7a853 9810 sao bernardo do campo SP
6 eed78ac17f7f795a19a709745f00cd4e 18608 botucatu SP
7 bd697db56ff8e7c9cd1abeb7f72029b2 3533 sao paulo SP
8 8a43128d7f9a3db592b866e6861a6cce 8080 sao paulo SP
9 49f8d2ea7b718d0ce587d34207caf2b4 9830 sao bernardo do campo SP
10 98dddbc4601dd4443ca174359b237166 87111 sarandi PR
11 f8201cab383e484733266d1906e2fdfa 88137 palhoca SC
12 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
13 e603cf3fec55f8697c9059638d6c8eb5 96080 pelotas RS
14 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP

Es lo mismo hacer append en pandas

In [17]:
display('df1', 'df2', 'df1.append(df2)')
<string>:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
<string>:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Out[17]:

df1

A B
1 A1 B1
2 A2 B2

df2

A B
3 A3 B3
4 A4 B4

df1.append(df2)

A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
In [24]:
quince = diez_datos.append(cinco_datos)
quince
/var/folders/hc/jwd08zzs6zg8d4ktgzqptl700000gp/T/ipykernel_5336/4097746820.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  quince = diez_datos.append(cinco_datos)
Out[24]:
seller_id seller_zip_code_prefix seller_city seller_state
1327 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP
1900 344223b2a90784f64136a8a5da012e7f 89117 gaspar SC
593 30829ded4523ab9224b93bc49a62c95f 95360 parai RS
2060 a2b911e9ad40dc7d61c142a4683e2d85 27353 barra mansa RJ
1574 2d9e8987a2ea5cf88dce330249043ff1 15155 jaci SP
1130 e3dd723429d1b23614b7fa85e2c7a853 9810 sao bernardo do campo SP
2377 eed78ac17f7f795a19a709745f00cd4e 18608 botucatu SP
790 bd697db56ff8e7c9cd1abeb7f72029b2 3533 sao paulo SP
1362 8a43128d7f9a3db592b866e6861a6cce 8080 sao paulo SP
2677 49f8d2ea7b718d0ce587d34207caf2b4 9830 sao bernardo do campo SP
3090 98dddbc4601dd4443ca174359b237166 87111 sarandi PR
3091 f8201cab383e484733266d1906e2fdfa 88137 palhoca SC
3092 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
3093 e603cf3fec55f8697c9059638d6c8eb5 96080 pelotas RS
3094 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP

Joins¶

In [20]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
Out[20]:

df1

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df2

employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014

Merge reconoce automaticamente que tenemos una columna de empleado y no necesita decirle con key mergear.

In [21]:
df3 = pd.merge(df1, df2)
df3
Out[21]:
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
In [22]:
geo_df = pd.read_csv("data/olist_geolocation_dataset.csv")
geo_df
Out[22]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 1037 -23.545621 -46.639292 sao paulo SP
1 1046 -23.546081 -46.644820 sao paulo SP
2 1046 -23.546129 -46.642951 sao paulo SP
3 1041 -23.544392 -46.639499 sao paulo SP
4 1035 -23.541578 -46.641607 sao paulo SP
... ... ... ... ... ...
1000158 99950 -28.068639 -52.010705 tapejara RS
1000159 99900 -27.877125 -52.224882 getulio vargas RS
1000160 99950 -28.071855 -52.014716 tapejara RS
1000161 99980 -28.388932 -51.846871 david canabarro RS
1000162 99950 -28.070104 -52.018658 tapejara RS

1000163 rows × 5 columns

In [25]:
quince
Out[25]:
seller_id seller_zip_code_prefix seller_city seller_state
1327 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP
1900 344223b2a90784f64136a8a5da012e7f 89117 gaspar SC
593 30829ded4523ab9224b93bc49a62c95f 95360 parai RS
2060 a2b911e9ad40dc7d61c142a4683e2d85 27353 barra mansa RJ
1574 2d9e8987a2ea5cf88dce330249043ff1 15155 jaci SP
1130 e3dd723429d1b23614b7fa85e2c7a853 9810 sao bernardo do campo SP
2377 eed78ac17f7f795a19a709745f00cd4e 18608 botucatu SP
790 bd697db56ff8e7c9cd1abeb7f72029b2 3533 sao paulo SP
1362 8a43128d7f9a3db592b866e6861a6cce 8080 sao paulo SP
2677 49f8d2ea7b718d0ce587d34207caf2b4 9830 sao bernardo do campo SP
3090 98dddbc4601dd4443ca174359b237166 87111 sarandi PR
3091 f8201cab383e484733266d1906e2fdfa 88137 palhoca SC
3092 74871d19219c7d518d0090283e03c137 4650 sao paulo SP
3093 e603cf3fec55f8697c9059638d6c8eb5 96080 pelotas RS
3094 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP

¿Que columna elijo para mergear?

In [27]:
geo_df.columns
Out[27]:
Index(['geolocation_zip_code_prefix', 'geolocation_lat', 'geolocation_lng',
       'geolocation_city', 'geolocation_state'],
      dtype='object')
In [29]:
pd.merge(quince,geo_df, how="left", left_on = "seller_zip_code_prefix", right_on="geolocation_zip_code_prefix" )
Out[29]:
seller_id seller_zip_code_prefix seller_city seller_state geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.652816 -46.599410 sao paulo SP
1 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.649507 -46.597288 sao paulo SP
2 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.652020 -46.598036 são paulo SP
3 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.645393 -46.593474 sao paulo SP
4 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.647841 -46.596295 são paulo SP
... ... ... ... ... ... ... ... ... ...
1178 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.011130 -45.592347 taubate SP
1179 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.013452 -45.584299 taubaté SP
1180 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.009155 -45.592019 taubate SP
1181 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.009019 -45.584925 taubate SP
1182 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.010990 -45.586938 taubaté SP

1183 rows × 9 columns

In [30]:
pd.merge(quince,geo_df, how="inner", left_on = "seller_zip_code_prefix", right_on="geolocation_zip_code_prefix" )
Out[30]:
seller_id seller_zip_code_prefix seller_city seller_state geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.652816 -46.599410 sao paulo SP
1 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.649507 -46.597288 sao paulo SP
2 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.652020 -46.598036 são paulo SP
3 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.645393 -46.593474 sao paulo SP
4 b18871c324516afa228a13e8f1ef0ae9 4186 sao paulo SP 4186 -23.647841 -46.596295 são paulo SP
... ... ... ... ... ... ... ... ... ...
1178 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.011130 -45.592347 taubate SP
1179 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.013452 -45.584299 taubaté SP
1180 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.009155 -45.592019 taubate SP
1181 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.009019 -45.584925 taubate SP
1182 9e25199f6ef7e7c347120ff175652c3b 12051 taubate SP 12051 -23.010990 -45.586938 taubaté SP

1183 rows × 9 columns

In [33]:
list(quince.seller_zip_code_prefix.unique())
Out[33]:
[4186,
 89117,
 95360,
 27353,
 15155,
 9810,
 18608,
 3533,
 8080,
 9830,
 87111,
 88137,
 4650,
 96080,
 12051]
In [39]:
list(geo_df.geolocation_zip_code_prefix.unique())
Out[39]:
[1037,
 1046,
 1041,
 1035,
 1012,
 1047,
 1013,
 1029,
 1011,
 1032,
 1014,
 1039,
 1024,
 1009,
 1020,
 1043,
 1050,
 1045,
 1007,
 1005,
 1017,
 1019,
 1027,
 1033,
 1006,
 1034,
 1042,
 1040,
 1026,
 1036,
 1048,
 1002,
 1031,
 1022,
 1008,
 1010,
 1003,
 1001,
 1015,
 1038,
 1016,
 1049,
 1021,
 1018,
 1044,
 1030,
 1025,
 1004,
 1023,
 1028,
 1155,
 1122,
 1150,
 1123,
 1139,
 1153,
 1104,
 1136,
 1130,
 1103,
 1156,
 1151,
 1140,
 1120,
 1106,
 1124,
 1131,
 1121,
 1141,
 1125,
 1102,
 1127,
 1137,
 1135,
 1138,
 1128,
 1109,
 1132,
 1108,
 1107,
 1154,
 1152,
 1134,
 1144,
 1129,
 1105,
 1126,
 1142,
 1133,
 1101,
 1189,
 1202,
 1230,
 1221,
 1217,
 1243,
 1203,
 1214,
 1257,
 1231,
 1216,
 1226,
 1220,
 1223,
 1213,
 1227,
 1240,
 1258,
 1222,
 1211,
 1244,
 1238,
 1204,
 1234,
 1228,
 1224,
 1229,
 1232,
 1218,
 1215,
 1239,
 1205,
 1201,
 1259,
 1210,
 1249,
 1212,
 1248,
 1233,
 1219,
 1235,
 1209,
 1236,
 1206,
 1207,
 1251,
 1242,
 1241,
 1208,
 1254,
 1255,
 1245,
 1225,
 1252,
 1237,
 1200,
 1250,
 1247,
 1246,
 1262,
 1253,
 1290,
 1256,
 1303,
 1309,
 1310,
 1323,
 1321,
 1333,
 1314,
 1331,
 1307,
 1318,
 1322,
 1308,
 1306,
 1301,
 1302,
 1312,
 1319,
 1316,
 1317,
 1313,
 1324,
 1327,
 1304,
 1332,
 1326,
 1315,
 1325,
 1329,
 1330,
 1311,
 1305,
 1320,
 1328,
 1402,
 1421,
 1403,
 1409,
 1411,
 1452,
 1423,
 1417,
 1408,
 1455,
 1401,
 1410,
 1453,
 1415,
 1416,
 1445,
 1424,
 1422,
 1419,
 1414,
 1420,
 1451,
 1446,
 1406,
 1454,
 1426,
 1413,
 1433,
 1427,
 1405,
 1435,
 1449,
 1425,
 1430,
 1404,
 1442,
 1407,
 1418,
 1472,
 1443,
 1428,
 1434,
 1457,
 1412,
 1448,
 1456,
 1432,
 1441,
 1489,
 1429,
 1450,
 1440,
 1438,
 1444,
 1436,
 1447,
 1439,
 1431,
 1437,
 1526,
 1509,
 1504,
 1536,
 1535,
 1511,
 1534,
 1514,
 1529,
 1539,
 1532,
 1542,
 1548,
 1543,
 1544,
 1546,
 1523,
 1506,
 1518,
 1510,
 1537,
 1531,
 1549,
 1533,
 1525,
 1520,
 1550,
 1503,
 1541,
 1545,
 1522,
 1508,
 1519,
 1547,
 1538,
 1501,
 1521,
 1530,
 1505,
 1502,
 1551,
 1540,
 1528,
 1513,
 1527,
 1524,
 1507,
 1512,
 1552,
 1553,
 1516,
 1517,
 2019,
 2069,
 2067,
 2020,
 2075,
 2012,
 2076,
 2035,
 2034,
 2022,
 2021,
 2080,
 2066,
 2085,
 2030,
 2064,
 2029,
 2060,
 2054,
 2072,
 2017,
 2018,
 2043,
 2074,
 2038,
 2016,
 2040,
 2056,
 2051,
 2041,
 2039,
 2073,
 2082,
 2033,
 2013,
 2079,
 2037,
 2083,
 2068,
 2071,
 2047,
 2078,
 2044,
 2084,
 2055,
 2042,
 2045,
 2036,
 2046,
 2010,
 2081,
 2050,
 2052,
 2031,
 2028,
 2065,
 2077,
 2011,
 2049,
 2032,
 2053,
 2009,
 2089,
 2088,
 2150,
 2133,
 2167,
 2130,
 2188,
 2152,
 2138,
 2134,
 2176,
 2117,
 2119,
 2131,
 2124,
 2129,
 2132,
 2110,
 2135,
 2127,
 2114,
 2178,
 2177,
 2120,
 2161,
 2112,
 2126,
 2169,
 2123,
 2189,
 2128,
 2179,
 2136,
 2174,
 2186,
 2137,
 2185,
 2181,
 2180,
 2111,
 2187,
 2121,
 2190,
 2139,
 2143,
 2122,
 2115,
 2118,
 2125,
 2113,
 2147,
 2141,
 2170,
 2166,
 2175,
 2168,
 2116,
 2145,
 2151,
 2146,
 2144,
 2162,
 2142,
 2251,
 2256,
 2246,
 2223,
 2225,
 2204,
 2235,
 2245,
 2243,
 2208,
 2222,
 2232,
 2203,
 2250,
 2263,
 2238,
 2261,
 2267,
 2270,
 2227,
 2273,
 2244,
 2229,
 2215,
 2259,
 2228,
 2254,
 2218,
 2247,
 2226,
 2274,
 2220,
 2217,
 2265,
 2276,
 2242,
 2269,
 2221,
 2279,
 2249,
 2237,
 2206,
 2258,
 2236,
 2205,
 2241,
 2219,
 2257,
 2230,
 2210,
 2266,
 2211,
 2280,
 2278,
 2264,
 2201,
 2253,
 2281,
 2248,
 2255,
 2277,
 2272,
 2271,
 2282,
 2224,
 2275,
 2233,
 2209,
 2212,
 2288,
 2260,
 2234,
 2213,
 2202,
 2262,
 2216,
 2214,
 2268,
 2207,
 2239,
 2252,
 2240,
 2231,
 2290,
 2284,
 2289,
 2286,
 2309,
 2377,
 2371,
 2323,
 2311,
 2374,
 2316,
 2366,
 2321,
 2315,
 2330,
 2319,
 2312,
 2341,
 2334,
 2347,
 2365,
 2375,
 2372,
 2367,
 2343,
 2306,
 2313,
 2305,
 2340,
 2360,
 2344,
 2348,
 2322,
 2356,
 2352,
 2358,
 2308,
 2318,
 2337,
 2342,
 2355,
 2303,
 2363,
 2324,
 2373,
 2302,
 2350,
 2378,
 2317,
 2351,
 2301,
 2332,
 2376,
 2326,
 2335,
 2346,
 2362,
 2357,
 2329,
 2314,
 2310,
 2325,
 2354,
 2320,
 2361,
 2338,
 2331,
 2333,
 2307,
 2349,
 2353,
 2336,
 2339,
 2368,
 2370,
 2328,
 2359,
 2345,
 2327,
 2364,
 2304,
 2430,
 2402,
 2442,
 2409,
 2467,
 2422,
 2441,
 2452,
 2435,
 2464,
 2431,
 2478,
 2436,
 2473,
 2415,
 2468,
 2401,
 2419,
 2479,
 2471,
 2417,
 2410,
 2420,
 2462,
 2461,
 2443,
 2403,
 2466,
 2404,
 2433,
 2406,
 2474,
 2434,
 2407,
 2421,
 2472,
 2470,
 2460,
 2475,
 2416,
 2418,
 2413,
 2451,
 2465,
 2405,
 2469,
 2440,
 2408,
 2450,
 2411,
 2432,
 2463,
 2423,
 2565,
 2542,
 2522,
 2536,
 2566,
 2519,
 2513,
 2550,
 2556,
 2544,
 2543,
 2560,
 2517,
 2510,
 2518,
 2554,
 2539,
 2530,
 2535,
 2537,
 2563,
 2561,
 2555,
 2564,
 2512,
 2534,
 2525,
 2562,
 2520,
 2557,
 2553,
 2546,
 2516,
 2514,
 2552,
 2524,
 2540,
 2508,
 2558,
 2545,
 2515,
 2531,
 2526,
 2523,
 2541,
 2567,
 2551,
 2521,
 2533,
 2511,
 2559,
 2532,
 2538,
 2598,
 2577,
 2501,
 2610,
 2676,
 2612,
 2632,
 2673,
 2616,
 2615,
 2617,
 2613,
 2630,
 2618,
 2651,
 2633,
 2675,
 2611,
 2652,
 2671,
 2650,
 2679,
 2674,
 2680,
 2636,
 2649,
 2634,
 2672,
 2614,
 2635,
 2670,
 2678,
 2637,
 2677,
 2631,
 2643,
 2653,
 2760,
 2712,
 2721,
 2739,
 2710,
 2730,
 2751,
 2765,
 2722,
 2752,
 2758,
 2764,
 2724,
 2755,
 2723,
 2720,
 2731,
 2726,
 2728,
 2757,
 2714,
 2735,
 2763,
 2736,
 2734,
 2762,
 2727,
 2733,
 2732,
 2759,
 2725,
 2738,
 2729,
 2715,
 2754,
 2737,
 2753,
 2756,
 2716,
 2717,
 2713,
 2701,
 2750,
 2855,
 2840,
 2874,
 2866,
 2801,
 2842,
 2879,
 2848,
 2809,
 2814,
 2841,
 2878,
 2871,
 2881,
 2815,
 2859,
 2820,
 2834,
 2861,
 2811,
 2882,
 2804,
 2865,
 2835,
 2837,
 2847,
 2839,
 2850,
 2831,
 2812,
 2813,
 2873,
 2863,
 2849,
 2860,
 2846,
 2877,
 2856,
 2883,
 2816,
 2845,
 2867,
 2810,
 2802,
 2817,
 2806,
 2862,
 2870,
 2805,
 2880,
 2864,
 2875,
 2851,
 2872,
 2853,
 2844,
 2803,
 2854,
 2808,
 2807,
 2858,
 2832,
 2836,
 2857,
 2833,
 2888,
 2829,
 2876,
 2992,
 2960,
 2976,
 2987,
 2995,
 2996,
 2952,
 2921,
 2917,
 2969,
 2935,
 2991,
 2951,
 2914,
 2989,
 2945,
 2998,
 2957,
 2920,
 2936,
 2977,
 2932,
 2910,
 2919,
 2962,
 2975,
 2929,
 2965,
 2927,
 2918,
 2942,
 2984,
 2964,
 2913,
 2937,
 2950,
 2966,
 2912,
 2986,
 2930,
 2968,
 2961,
 2983,
 2990,
 2911,
 2982,
 2967,
 2941,
 2926,
 2993,
 2939,
 2928,
 2915,
 2943,
 2933,
 2940,
 2931,
 2972,
 2925,
 2934,
 2956,
 2963,
 2985,
 2978,
 2938,
 2944,
 2916,
 2924,
 2959,
 2979,
 2997,
 2994,
 2970,
 2955,
 2971,
 2981,
 2909,
 2908,
 2988,
 3088,
 3012,
 3080,
 3084,
 3060,
 3025,
 3081,
 3023,
 3021,
 3018,
 3090,
 3043,
 3029,
 3071,
 3064,
 3057,
 3053,
 3089,
 3015,
 3066,
 3009,
 3065,
 3077,
 3041,
 3059,
 3020,
 3062,
 3063,
 3052,
 3061,
 3086,
 3034,
 3022,
 3058,
 3087,
 3073,
 3003,
 3024,
 3004,
 3030,
 3045,
 3069,
 3001,
 3047,
 3006,
 3035,
 3067,
 3019,
 3040,
 3027,
 3011,
 3050,
 3085,
 3008,
 3042,
 3016,
 3013,
 3070,
 3002,
 3031,
 3051,
 3068,
 3072,
 3010,
 3014,
 3028,
 3054,
 3033,
 3055,
 3005,
 3017,
 3032,
 3026,
 3007,
 3076,
 3044,
 3036,
 3046,
 3074,
 3165,
 3166,
 ...]
In [37]:
geo_df.geolocation_zip_code_prefix.nunique()
Out[37]:
19015
In [41]:
lista_unicos_geo = list(geo_df.geolocation_zip_code_prefix.unique())
In [42]:
quince.seller_zip_code_prefix.isin(lista_unicos_geo).sum()
Out[42]:
15
In [43]:
quince.seller_zip_code_prefix.isin(lista_unicos_geo)
Out[43]:
1327    True
1900    True
593     True
2060    True
1574    True
1130    True
2377    True
790     True
1362    True
2677    True
3090    True
3091    True
3092    True
3093    True
3094    True
Name: seller_zip_code_prefix, dtype: bool
In [46]:
unicos_sellers = list(quince.seller_zip_code_prefix.unique())
geo_df.geolocation_zip_code_prefix.isin(unicos_sellers).sum()
Out[46]:
1183
In [47]:
df_filtrado = geo_df[geo_df.geolocation_zip_code_prefix.isin(unicos_sellers)]
df_filtrado
Out[47]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
58652 3533 -23.543315 -46.521247 sao paulo SP
58659 3533 -23.546867 -46.511526 sao paulo SP
59031 3533 -23.542838 -46.522099 sao paulo SP
59072 3533 -23.545301 -46.514192 são paulo SP
59137 3533 -23.542007 -46.523187 sao paulo SP
... ... ... ... ... ...
977477 96080 -31.753613 -52.315717 pelotas RS
977494 96080 -31.747875 -52.327321 pelotas RS
977528 96080 -31.753442 -52.315974 pelotas RS
977570 96080 -31.749108 -52.321976 pelotas RS
977579 96080 -31.748636 -52.310751 pelotas RS

1183 rows × 5 columns

In [48]:
df_filtrado.groupby("geolocation_zip_code_prefix").agg('nunique')
Out[48]:
geolocation_lat geolocation_lng geolocation_city geolocation_state
geolocation_zip_code_prefix
3533 34 34 2 1
4186 34 34 2 1
4650 42 42 2 1
8080 141 140 2 1
9810 127 128 2 1
9830 37 37 2 1
12051 64 64 2 1
15155 27 27 1 1
18608 162 163 1 1
27353 21 21 1 1
87111 62 62 1 1
88137 27 27 2 1
89117 16 16 1 1
95360 28 28 2 1
96080 84 84 1 1

No solo se pueden repetir datos tmb se pueden borrar en los joins

In [50]:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
Out[50]:

df6

name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7)

name food drink
0 Mary bread wine
In [51]:
pd.merge(df6, df7, how='inner')
Out[51]:
name food drink
0 Mary bread wine
In [52]:
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
Out[52]:

df6

name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7, how='left')

name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
In [53]:
display('df7', 'df6', "pd.merge(df7, df6, how='left')")
Out[53]:

df7

name drink
0 Mary wine
1 Joseph beer

df6

name food
0 Peter fish
1 Paul beans
2 Mary bread

pd.merge(df7, df6, how='left')

name drink food
0 Mary wine bread
1 Joseph beer NaN

Tips de programacion¶

args and kwargs

In [54]:
def funcion_generica(parametro_obligatorio, *args, **kwargs):
    """This is the function docstring"""

    print(parametro_obligatorio)

    if args:
        print(args)
    if kwargs:
        print(kwargs)
In [55]:
funcion_generica(1, 2, 3, variable_no_obligatoria = 5)
1
(2, 3)
{'variable_no_obligatoria': 5}
In [56]:
a_list = [1,2,3]
a_dict = {'bar': 5}

funcion_generica(*a_list, **a_dict)
1
(2, 3)
{'bar': 5}

Clases¶

In [57]:
class Student:
    school = 'UCEMA'

    def __init__(self, name, age):
        self.name = name
        self.age = age
In [59]:
alice = Student('Alicia', 20)
alice.__dict__
Out[59]:
{'name': 'Alicia', 'age': 20}
In [60]:
d = {'a': 1, 'b': 2}
d.update(c=3)

d
Out[60]:
{'a': 1, 'b': 2, 'c': 3}
In [61]:
class Student:
    school = 'UCEMA'

    def __init__(self, name, age, **kwargs):
        self.name = name
        self.age = age

        self.__dict__.update(**kwargs)
In [62]:
alice = Student('Alicia', 20, nationality='francesa')
alice.nationality
Out[62]:
'francesa'
In [64]:
alice.__dict__
Out[64]:
{'name': 'Alicia', 'age': 20, 'nationality': 'francesa'}

Herencia!

In [65]:
class DataStudent(Student):
    course = 'data'
    #se agrega un atributo en la clase hija
    def __init__(self, name, age, año, **kwargs):
        super().__init__(name, age, **kwargs)
        self.año = año
In [66]:
DataStudent('Alicia', 20, nationality='francesa').__dict__
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [66], in <cell line: 1>()
----> 1 DataStudent('Alicia', 20, nationality='francesa').__dict__

TypeError: __init__() missing 1 required positional argument: 'año'
In [67]:
DataStudent('Alicia', 20, nationality='francesa', año="2022").__dict__
Out[67]:
{'name': 'Alicia', 'age': 20, 'nationality': 'francesa', 'año': '2022'}

Plotly¶

In [68]:
!pip install plotly
Collecting plotly
  Downloading plotly-5.10.0-py2.py3-none-any.whl (15.2 MB)
     |████████████████████████████████| 15.2 MB 11.5 MB/s            
Collecting tenacity>=6.2.0
  Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.10.0 tenacity-8.1.0
WARNING: You are using pip version 21.3.1; however, version 22.3 is available.
You should consider upgrading via the '/Users/federicomoreno/PycharmProjects/UCEMA/venv/bin/python -m pip install --upgrade pip' command.
In [70]:
import plotly.express as px

df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species")

fig.show();

Scikit-learn¶

Scikit-learn (Sklearn) es una biblioteca de aprendizaje automático que proporciona herramientas de preprocesamiento de datos, modelado y selección de modelos.

👉 https://scikit-learn.org

Instalar Sklearn¶

Varias opciones

!pip install -U scikit-learn

!conda install scikit-learn

Ir al interpreter de pycharm

👉 Documentacion de instalación https://scikit-learn.org/stable/install.html

Estructura de la libreria¶

  • Sklearn está organizado por módulos
  • Cada módulo contiene herramientas en forma de clases.

linear_model module¶

  • linear_model es un modulo
  • LinearRegression es una clase

👉 https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

Como importar modulo y clases¶

Hay muchas formas de importar módulos y clases en cuadernos, pero hay una mejor práctica.

In [ ]:
🚫 import sklearn # importación de toda la biblioteca
model = sklearn.linear_model.LnearRegression() # debe escribir la biblioteca y el prefijo del módulo cada vez
In [ ]:
🚫 import sklearn.linear_model # importación de todo el módulo
model = linear_model.LinearRegression() # debe escribir el prefijo del módulo cada vez
In [ ]:
🚫 from sklearn import linear_model # importación de todo el módulo
model = linear_model.LinearRegression() # debe escribir el prefijo del módulo cada vez
In [ ]:
🚫 desde sklearn.linear_model import * # importación de todo el módulo
modelo = Regresión Lineal()

Zen of python¶

In [60]:
import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
In [ ]:
"Explicit is better than implicit" 
✅ from sklearn.linear_model import LinearRegression # explicit class import from module
model = LinearRegression() #=> we know where this object comes from

Reg Logistica¶

reglogistica

Recap Reg Lineal¶

Una regresión lineal (OLS) mapea una relación lineal entre la entrada X y la salida y. Optimiza la pendiente a y la intersección b al reducir los residuos entre la y real y la y pronosticada.

reg

Regresion Lineal con Sklearn¶

Linear Regression con Sklearn¶

In [62]:
import pandas as pd

data = pd.read_csv("house.csv")
data.head()
Out[62]:
Unnamed: 0 Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 0 1 60 RL 65.0 8450 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 1 2 20 RL 80.0 9600 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 86 columns

Comencemos de manera simple modelando SalePrice (y) de acuerdo con GrLivArea (X).

In [63]:
livecode_data = data[['GrLivArea','SalePrice']]

livecode_data.head()
Out[63]:
GrLivArea SalePrice
0 1710 208500
1 1262 181500
2 1786 223500
3 1717 140000
4 2198 250000

Exploracion¶

In [65]:
import matplotlib.pyplot as plt

# Plot Living area vs Sale price
plt.scatter(data['GrLivArea'], data['SalePrice'])

# Labels
plt.xlabel("Living area")
plt.ylabel("Sale price")

plt.show()

Training¶

Entrenando una regresion linal con LinearRegression

In [66]:
# Import the model
from sklearn.linear_model import LinearRegression

# Instanciate the model (💡 in Sklearn often called "estimator")
model = LinearRegression()

# Define X and y
X = data[['GrLivArea']]
y = data['SalePrice']

# Train the model on the data
model.fit(X, y)
Out[66]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

En esta etapa, el modelo ha aprendido los parámetros óptimos (pendiente a e intersección b) necesarios para mapear la relación entre X e y.

Atributos del modelo¶

In [67]:
# View the model's slope (a)
model.coef_ 
Out[67]:
array([105.00927564])
In [68]:
# View the model's intercept (b)
model.intercept_ 
Out[68]:
22104.12101002081

Scoring¶

Cada algoritmo de Scikit-learn tiene una métrica de puntuación predeterminada.

LinearRegression utiliza el Coeficiente de determinación (R2) por defecto.

R2 representa la proporción de la varianza del objetivo explicada por las características. La puntuación suele variar entre 0 y 1 Cuanto mayor sea la puntuación, mejor será el modelo.

In [69]:
# Evaluate the model's performance
model.score(X,y)
Out[69]:
0.48960426399689116

Predecir¶

In [70]:
#  Predict on new data
model.predict([[1000]])
/Users/federicomoreno/PycharmProjects/UCEMA/venv/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

Out[70]:
array([127113.39664561])

Predice un departamento de 1000 pies cuadrados y lo estima en casi 125mil dolares

Flujo de modelado Sklearn¶

  1. Importar el modelo: desde el modelo de importación de sklearn
  2. Crea una instancia del modelo: modelo = modelo()
  3. Entrena el modelo: model.fit(X, y)
  4. Evaluar el modelo: model.score(new_X, new_y)
  5. Hacer predicciones: model.predict(new_X)

Generalización¶

Splitting¶

holdout

In [71]:
from sklearn.model_selection import train_test_split

# split the data into train and test
train_data, test_data = train_test_split(livecode_data, test_size=0.3)

# Ready X's and y's
X_train = train_data[['GrLivArea']]
y_train = train_data['SalePrice']

X_test = test_data[['GrLivArea']]
y_test = test_data['SalePrice']

Version 2¶

In [72]:
# Ready X and y
X = livecode_data[['GrLivArea']]
y = livecode_data['SalePrice']

# Split into Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Training¶

In [73]:
# Instantiate the model
model = LinearRegression()

# Train the model on the Training data
model.fit(X_train, y_train)

# Score the model on the Test data
model.score(X_test,y_test)
Out[73]:
0.5384589652653409

generalización

In [ ]: